getwd() setwd(‘~/desktop/tools/r/WhiteWines’)
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = FALSE,message = FALSE,warning = FALSE)
R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
White_Wines
Loading Dataset
##
## The downloaded binary packages are in
## /var/folders/7v/jrlxtfqx5sb6y5520z15qjsr0000gn/T//RtmpEwEsiB/downloaded_packages
Univariate Analysis:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
We can look into the structure of our wine dataset. Which will help us to prepare for visualization.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
In this table we are trying to understand the range of quality variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
To understand the distribution of quality.we can see that we have very high values for quality 6 compared to others.

To understand the distribution of fixed acidity with respect to freq poly. we can see that mean values comes around 3.15 ph

we can see the mean of fixed acidity comes around 6 to 8. Most of the wine quality has this range.


To understand the distribution of citric acid and volatile acidity in the data. Citric acid has a mean value around 0.25 for its distribution and whereas volatile acidity has around 0.25 as mean in white wine.

From this we can see that density has a mean value around 1. It has around 900 counts.

From this we can see that sulphates count increases and decreases around 0.4 to 0.6. we dont know yet whether it is a good factor or bad

In this graph we can see that there is very high sugar value concentration around 0 to 1. So most of the wine quality has this sugar value.

Distribution of chlorides is normal based on the visualization. But there are many outliers present in it.

Alcohol count decreases and increases on certian value. Further analysis are needed to view certain regions where it increases or decreases.
##
## The downloaded binary packages are in
## /var/folders/7v/jrlxtfqx5sb6y5520z15qjsr0000gn/T//RtmpEwEsiB/downloaded_packages

Inthis data we try to understand the difference between both free sulfur dioxide and total sulphur dioxide. we can see that free sulphur dioxide has more concentration than total sulphur dioxide.
Univariate Summary:
From this we have analyzed the certain topics in the dataset. Summary and structure of the data is shown in the graph which helps us identify the meaning of the data. Since we are taking quality as our subject we can see from the histogram of the quality that the mean is 6. From the frequency poly we can see that the acidity and ph levels are around 6 to 8 and 3 to 3.3. Citric acid has a mean value around .3. The box plots and histogram of the remaining variables will help to understand the range and data of the dataset.We have created the box plots to see the outliers present in the data.
Bivariate Plot Analysis

##
## Pearson's product-moment correlation
##
## data: ww$fixed.acidity and ww$quality
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14121974 -0.08592991
## sample estimates:
## cor
## -0.1136628
In this box plot we can measure fixed acidity with respect to quality. From this we can see that quality rank 9 has higher fixed_acidity compared to others.

In this graph we can see the quality factor based on chlorides.From this boxplots we can see that quality level 7,8,9 has lower chloride level compared to others.
Using GGpairs to find the relationships between all variables

GGpairs graph is quite useful to find the relationship between variables and find relationships based on the correlation.
We can explore the data between Quality and Alcohol by using boxplots. In this plots we can see the high Quality median value increasing based on alcohol content.

In this we can see the relationship between density and alcohol and how the concentration of alcohol decreases when increase in density. Increase in density also affects quality. Linear model is shown about the decrease in alcohol content.
Bivariate summary:
In the first when we plot between quality and acidity, quality and chlorides. we can see that acidity levels are low for good quality and poor quality. So acidity levels does not change the quality levels.Even the chloride levels indicate the same results, where high and low quality levels have same chloride content.From the ggpairs we can see the relationship between all the variables in that quality and alcohol correlate much better than others. We have rounded the ph value to its nearest decimal and transformed the data.
MultiVariate Analysis
In this data we have taken x axis as quality and y axis as alcohol and colored with density to show the data based on rounded ph

In this graph we can see the relationship between alcohol content and density. As density increases alcohol content decreases and from the quality we can see it reduces. Most of the dark blue dots are in the top.

From this Scatterplot we can approxiamtely see that good quality wine will have a ph of 3 to 3.4 with less density and alcohol level about 10 - 14.

From this graph we can see that residual sugar in density with respect to quality.we can see that concentration of residual sugar is decreasing with respect to quality. But even low quality has very less residual sugar. So we cant come to any conclusion

From this we can see that even low quality and high quality has very high ph with respect to residual sugar.

From this we can see that most of the good Quality wine has very high ph of 3.75 with 100:0.99 concentration of total.sulfur.dioxide: density. This is important in analyzing quality in wine.

This scatterplot will help us to understand the relationship between alcohol and density. We cannot really find a pattern here because most of the values seen are in the region 5 - 7. So this pattern is very difficult to analyze the relationship between alcohol and density with respect to quality.

In this scatterplot we can see percentahge of chloride content present in alcohol. This would be useful in measuring the role of cholride and ph in alcohol.

In this scatterplot, we can see that as quality increases, the concentration of citric acid in alcohol content increases towards 12 to 13.

From this graph we can see the reaction between alcohol and sulphates with respect to quality. We can see that the quality increases the concentration of sulphates in alcohol increases towards 12 - 14.

From various analysis done above we can say that good quality wine will have a ph of 3 to 3.4 with less density and good alcohol level. So alcohol content and quality correlate with each other. But when density increases in the wine it decreases the alcohol content and thus decreases the quality of wine.The sulphate content and chloride content decreases in concentration with respect to alcohol. These are some of the factors i observed in the explanatory data analysis process
Final Plots and Summary: Plot 1:

Plot 1 Analysis
From this Scatterplot we can approxiamtely see that good quality wine will have a ph of 3 to 3.4 with less density and alcohol level about 10 - 14.
Plot 2: 
Plot 2 Analysis
From this box plots we can see clearly see the variation present in between quality and alcohol. From quality ranking 5 the alcohol content increases linearly with respect to quality. This is clear indication of positive relarionship between quality and alcohol.
For Quality ranking 3 and 4 we can see from plot 1 that it might have got affected by density. The other factors the wine quality might have been affected by is due to the concentration of chlorides or sulphur dioxide as we have seen earlier.
Plot 3: 
Plot 3 Analysis
From Plot 2 we can see the effect of Quality and Alcohol, how it interrelates. From this scatterplot we can see that chlorides in alcohol plays a vital role in understanding quality. Chlorides content in alcohol should be in the ratio of 0.0 to 0.05 to 11 to 14 parts of alcohol. When the wine content does not satisfies this relationship, the quality of wine drops.In order to increase the quality, we have to make sure that it satisfies this criteria.
Reflection
For the given dataset it was difficult to find the relationships between variables. It was quite difficult to understand the chemical properties present in White wines. The most important issue was there wasn’t much data avaialable for higher quality. Quality 8 and 9 are much less compared to others. Still i tried my best to understand the data and find exploratory realtionships between them. Since there is not any variable linearly corelated with others it is quite useless to perform linear regression.
Quality variable is used as dependent variable and others as independent. Based on this approach analysis is been made. We did some analysis based on acidity and ph levels, but none were successful in finding the results. Further approach gave some relationships between density and alcohol and alcohol and quality. These 2 variables gave some knowledge to classify the quality of the wine and anlalysis has been made on them. When density is less with good alcohol content and formidable Ph and chloride concentration will give us good Quality wine.
If there is enough dataset for higher qualities of wine in future, it will help us identify key features to analyze the chemical properties. The other factor missing is the price. Since price of wine could have helped us to broaden the search for quality and could give us further more relationships to predict the price.
Anyway from the data we can see that chlorides, alcohol content, density and Ph played crucial role understanding the quality. Since quality is my dependent variable i hope i did my best in describing the data.